Final Assessment Programming 1¶

By: Jacob Menzinga (357758)

Introduction¶

In the united states the crime rates steadily increased up untill the 1990's. Then, a sharp decrease in crime rates was observed. To explain this decrease, researchers have come up with multple reasons. One factor that drew attention was lead exposure. This lead to the Lead-Crime hypothesis. A lot of papers since have been published on this. One example is a meta analysis published by Anthony Higney, et al. (2022) in wich they write: "Our estimates suggest the abatement of lead pollution may be responsible for 7–28% of the fall in homicide in the US"

For the final asessment of programming 1 I decided to investigate if such a relation is also visible in the netherlands. To take blood-lead values of the whole population of the Netherlands is a bit unethical and expensive so I decided to look at the amount of lead in wastewater instead

Hypotheses¶

A higher amount of lead in wastewater correlates to a higher incidence of violent crimes

Data sources¶
  1. Wastewater treatment in the netherlands
    From this dataset the following features were selected:

    • Onderwerpen -> Aanvoer van afvalwater -> Hoeveelheden:

      • Volume afvalwater in 1000 m3
      • Lood in kg
    • Regios:

      • Provincies
    • Perioden:

      • from 2010 up to and including 2020
  2. Registered crime in the netherlands
    From this dataset the following features were selected:

    • Onderwerpen -> Geregistreerde misdrijven:

      • Geregistreerde misdrijven per 1000 inw.
    • Soort misdrijf:

      • 111 Diefstal en inbraak met geweld
      • 15 Afpersing en afdreiging
      • 21 Vernieling en beschadiging
      • 221 Openlijke geweldpleging
      • 23 Brandstichting / ontploffing
      • 3 Gewelds- en seksuele misdrijven
      • 7 Vuurwapenmisdrijven
    • Regios:

      • Provincies
    • Perioden:

      • from 2010 up to and including 2020
Reading in the data¶
In [1]:
# Imports
import yaml

import pandas as pd
import numpy as np

from bokeh.io import output_notebook, show
from bokeh.plotting import figure, show
output_notebook()

import hvplot.pandas
import holoviews as hv
hv.extension('bokeh')

import folium
from folium import plugins

import DS_Module as ds
Loading BokehJS ...
Supporting Functions¶
In [2]:
def check_data(df):
    """
    A function to check any dataframe for:
        - Missing data
        - Datatypes
        - Descriptive statistics
        
    It then prints it's findings 

    Args:
        df (pd.DataFrame): Any dataframe.
    """
    missing_data = df.isna().sum()
    if missing_data.values.sum() == 0:
        print('Missing data:')
        print('No missing data :)')
        missing_loc = 'None'
    
    else:
        # missing_data['perc. of data'] = df.isna().sum()/(len(df))*100
        missing_loc = df[df.isnull().any(axis=1)]
        print(f"Missing data per column:\n{missing_data}\n")
        print("The missing data is located in the following rows:")
        print(missing_loc)
    
    
    dtypes = df.dtypes
    print('\nData types:')
    print(dtypes)
    
    describe = df.describe()
    print(f'\nDescription of the dataframe')
    print(describe)
    
Importing and cleaning data¶
In [3]:
with open('config.yaml') as stream:
    config = yaml.safe_load(stream)
    
crime_df = pd.read_csv(config['crime'], delimiter=';')
lead_df = pd.read_csv(config['lead'], delimiter=';')

First I'll have a look at crime_df

In [4]:
crime_df.rename(columns= {'SoortMisdrijf':'Crime',
                         'RegioS':'Province', 'Perioden':'Year',
                         'GeregistreerdeMisdrijvenPer1000Inw_3':'Incidence'},
                inplace=True)
crime_df
Out[4]:
ID Crime Province Year Incidence
0 17712 CRI1110 PV20 2010JJ00 1.0
1 17713 CRI1110 PV20 2011JJ00 0.7
2 17714 CRI1110 PV20 2012JJ00 0.6
3 17715 CRI1110 PV20 2013JJ00 0.5
4 17716 CRI1110 PV20 2014JJ00 0.4
... ... ... ... ... ...
996 417702 CRI7000 PV99 2016JJ00 NaN
997 417703 CRI7000 PV99 2017JJ00 NaN
998 417704 CRI7000 PV99 2018JJ00 NaN
999 417705 CRI7000 PV99 2019JJ00 NaN
1000 417706 CRI7000 PV99 2020JJ00 NaN

1001 rows × 5 columns

In [5]:
check_data(crime_df)
Missing data per column:
ID            0
Crime         0
Province      0
Year          0
Incidence    60
dtype: int64

The missing data is located in the following rows:
          ID    Crime Province      Year Incidence
132    17856  CRI1110   PV99    2010JJ00       NaN
133    17857  CRI1110   PV99    2011JJ00       NaN
134    17858  CRI1110   PV99    2012JJ00       NaN
135    17859  CRI1110   PV99    2013JJ00       NaN
136    17860  CRI1110   PV99    2014JJ00       NaN
137    17861  CRI1110   PV99    2015JJ00       NaN
138    17862  CRI1110   PV99    2016JJ00       NaN
140    17864  CRI1110   PV99    2018JJ00       NaN
141    17865  CRI1110   PV99    2019JJ00       NaN
142    17866  CRI1110   PV99    2020JJ00       NaN
275    82536  CRI1500   PV99    2010JJ00       NaN
277    82538  CRI1500   PV99    2012JJ00       NaN
278    82539  CRI1500   PV99    2013JJ00       NaN
280    82541  CRI1500   PV99    2015JJ00       NaN
281    82542  CRI1500   PV99    2016JJ00       NaN
282    82543  CRI1500   PV99    2017JJ00       NaN
285    82546  CRI1500   PV99    2020JJ00       NaN
418   111936  CRI2100   PV99    2010JJ00       NaN
419   111937  CRI2100   PV99    2011JJ00       NaN
420   111938  CRI2100   PV99    2012JJ00       NaN
421   111939  CRI2100   PV99    2013JJ00       NaN
422   111940  CRI2100   PV99    2014JJ00       NaN
423   111941  CRI2100   PV99    2015JJ00       NaN
424   111942  CRI2100   PV99    2016JJ00       NaN
425   111943  CRI2100   PV99    2017JJ00       NaN
426   111944  CRI2100   PV99    2018JJ00       NaN
427   111945  CRI2100   PV99    2019JJ00       NaN
428   111946  CRI2100   PV99    2020JJ00       NaN
561   147216  CRI2210   PV99    2010JJ00       NaN
562   147217  CRI2210   PV99    2011JJ00       NaN
564   147219  CRI2210   PV99    2013JJ00       NaN
567   147222  CRI2210   PV99    2016JJ00       NaN
568   147223  CRI2210   PV99    2017JJ00       NaN
569   147224  CRI2210   PV99    2018JJ00       NaN
571   147226  CRI2210   PV99    2020JJ00       NaN
705   194257  CRI2300   PV99    2011JJ00       NaN
706   194258  CRI2300   PV99    2012JJ00       NaN
712   194264  CRI2300   PV99    2018JJ00       NaN
847   235416  CRI3000   PV99    2010JJ00       NaN
848   235417  CRI3000   PV99    2011JJ00       NaN
849   235418  CRI3000   PV99    2012JJ00       NaN
850   235419  CRI3000   PV99    2013JJ00       NaN
851   235420  CRI3000   PV99    2014JJ00       NaN
852   235421  CRI3000   PV99    2015JJ00       NaN
853   235422  CRI3000   PV99    2016JJ00       NaN
854   235423  CRI3000   PV99    2017JJ00       NaN
855   235424  CRI3000   PV99    2018JJ00       NaN
856   235425  CRI3000   PV99    2019JJ00       NaN
857   235426  CRI3000   PV99    2020JJ00       NaN
990   417696  CRI7000   PV99    2010JJ00       NaN
991   417697  CRI7000   PV99    2011JJ00       NaN
992   417698  CRI7000   PV99    2012JJ00       NaN
993   417699  CRI7000   PV99    2013JJ00       NaN
994   417700  CRI7000   PV99    2014JJ00       NaN
995   417701  CRI7000   PV99    2015JJ00       NaN
996   417702  CRI7000   PV99    2016JJ00       NaN
997   417703  CRI7000   PV99    2017JJ00       NaN
998   417704  CRI7000   PV99    2018JJ00       NaN
999   417705  CRI7000   PV99    2019JJ00       NaN
1000  417706  CRI7000   PV99    2020JJ00       NaN

Data types:
ID            int64
Crime        object
Province     object
Year         object
Incidence    object
dtype: object

Description of the dataframe
                  ID
count    1001.000000
mean   172349.000000
std    120100.686889
min     17712.000000
25%     82508.000000
50%    147149.000000
75%    235310.000000
max    417706.000000

Two things I took away from the datacheck: 1) There are a lot of missing values in the PV99 region. I looked this region code up in the metadata file of the crime dataset (also downloadable from the above link) and this is a category for 'uncatogarisable' data so I will drop these rows.

2) I want to turn the Year and Incidence columns into int and float dtypes respectively

In [6]:
# Dropping the PV99 region
crime_df = crime_df[crime_df['Province'] != 'PV99  ']
crime_df
# This seems to have messed up the index a bit, 
# concidering there's 924 rows in the df but the index gows up to 989.
Out[6]:
ID Crime Province Year Incidence
0 17712 CRI1110 PV20 2010JJ00 1.0
1 17713 CRI1110 PV20 2011JJ00 0.7
2 17714 CRI1110 PV20 2012JJ00 0.6
3 17715 CRI1110 PV20 2013JJ00 0.5
4 17716 CRI1110 PV20 2014JJ00 0.4
... ... ... ... ... ...
985 417690 CRI7000 PV31 2016JJ00 0.3
986 417691 CRI7000 PV31 2017JJ00 0.3
987 417692 CRI7000 PV31 2018JJ00 0.4
988 417693 CRI7000 PV31 2019JJ00 0.5
989 417694 CRI7000 PV31 2020JJ00 0.5

924 rows × 5 columns

In [7]:
crime_df = crime_df.reset_index().drop('index', axis=1)
In [8]:
# Checking the values in the Incidence and Year columns
print(crime_df['Incidence'].unique())
print(crime_df['Year'].unique())
['     1.0' '     0.7' '     0.6' '     0.5' '     0.4' '     0.3'
 '     0.2' '       .' '     1.1' '     1.2' '     0.9' '     0.8'
 '     1.6' '     1.5' '     1.3' '     0.1' '     0.0' '     9.5'
 '     9.1' '     8.5' '     7.4' '     6.6' '     5.9' '     5.7'
 '     4.7' '     5.1' '     4.9' '     8.8' '     7.8' '     6.7'
 '     6.1' '     5.6' '     4.4' '     4.1' '     4.3' '     4.6'
 '     8.6' '     8.4' '     7.0' '     5.5' '     3.7' '     4.0'
 '     7.9' '     7.6' '     6.5' '    10.6' '    10.3' '     8.2'
 '     7.1' '     5.8' '     5.0' '     4.5' '     4.8' '     8.9'
 '     8.1' '     7.3' '     6.3' '     5.3' '     4.2' '     3.6'
 '     8.0' '     9.0' '     9.8' '     7.5' '     6.8' '    10.2'
 '     6.4' '     6.9' '     3.8' '     9.3' '    10.1' '     7.7'
 '     7.2' '     6.2' '     5.2' '     5.4' '     6.0' '     3.4']
['2010JJ00' '2011JJ00' '2012JJ00' '2013JJ00' '2014JJ00' '2015JJ00'
 '2016JJ00' '2017JJ00' '2018JJ00' '2019JJ00' '2020JJ00']

I'm assuming the ' .' values for incidence should be 0, to check this I want to plot the 5 datapoints before these values to see if they show a trend decreasing toward 0

In [9]:
missing_crime = crime_df[crime_df['Incidence']=='       .'].index
missing_crime = np.array(missing_crime)

trend_ind = np.empty(0)

for i in range (6):
    trend_ind = np.append(trend_ind, missing_crime-i)
    
trend_ind = np.unique(trend_ind)

# this list I will use after after replacing the '.' values to 0 to check if 
# my assumption was correct
In [10]:
# replacing the '       .' value with 0
crime_df['Incidence'] = crime_df['Incidence'].str.replace('       .', '0', regex=False)

# Typecasting the Year and Incidence columns
crime_df['Year'] = crime_df['Year'].str.replace('JJ00','').astype(int)
crime_df['Incidence'] = crime_df['Incidence'].astype(float)
In [11]:
# Plotting the incidence, per region and crime. 0 values are interpolated.
hv.output(widget_location='top_left')
crime_df.iloc[trend_ind
             ].sort_values(by='Year').hvplot.line(x='Year', 
                                                  y='Incidence', 
                                                  groupby=['Crime', 'Province'])
Out[11]:
In [12]:
# Now that that's done, I'll run the check data again to see if I got rid of all the missing data
check_data(crime_df)
Missing data:
No missing data :)

Data types:
ID             int64
Crime         object
Province      object
Year           int32
Incidence    float64
dtype: object

Description of the dataframe
                  ID        Year   Incidence
count     924.000000   924.00000  924.000000
mean   172343.000000  2015.00000    1.885498
std    120105.690147     3.16399    2.661858
min     17712.000000  2010.00000    0.000000
25%     82499.500000  2012.00000    0.200000
50%    147143.000000  2015.00000    0.400000
75%    235306.500000  2018.00000    4.125000
max    417694.000000  2020.00000   10.600000

The crime_df has the different types of crime in one column, I would like each crime as a different feature with he incedence as their value

In [13]:
crime_df = crime_df.set_index(['Province','Year']).pivot(columns='Crime', values='Incidence').reset_index()
crime_df
Out[13]:
Crime Province Year CRI1110 CRI1500 CRI2100 CRI2210 CRI2300 CRI3000 CRI7000
0 PV20 2010 1.0 0.0 9.5 0.5 0.6 8.6 0.3
1 PV20 2011 0.7 0.0 9.1 0.4 0.6 8.0 0.4
2 PV20 2012 0.6 0.0 8.5 0.5 0.5 7.7 0.3
3 PV20 2013 0.5 0.0 7.4 0.3 0.3 7.2 0.3
4 PV20 2014 0.4 0.0 6.6 0.2 0.3 6.6 0.2
... ... ... ... ... ... ... ... ... ...
127 PV31 2016 0.4 0.1 5.6 0.2 0.4 5.0 0.3
128 PV31 2017 0.3 0.0 4.9 0.2 0.4 4.7 0.3
129 PV31 2018 0.3 0.0 4.3 0.2 0.3 4.6 0.4
130 PV31 2019 0.4 0.1 4.8 0.2 0.3 4.6 0.5
131 PV31 2020 0.3 0.1 4.6 0.1 0.3 4.1 0.5

132 rows × 9 columns

And also a total incidence would be usefull

In [14]:
crime_df['Total_incidence'] = crime_df[[
    'CRI1110', 'CRI1500', 'CRI2100', 'CRI2210', 'CRI2300', 'CRI3000', 'CRI7000']
                                       ].sum(axis=1)
crime_df
Out[14]:
Crime Province Year CRI1110 CRI1500 CRI2100 CRI2210 CRI2300 CRI3000 CRI7000 Total_incidence
0 PV20 2010 1.0 0.0 9.5 0.5 0.6 8.6 0.3 20.5
1 PV20 2011 0.7 0.0 9.1 0.4 0.6 8.0 0.4 19.2
2 PV20 2012 0.6 0.0 8.5 0.5 0.5 7.7 0.3 18.1
3 PV20 2013 0.5 0.0 7.4 0.3 0.3 7.2 0.3 16.0
4 PV20 2014 0.4 0.0 6.6 0.2 0.3 6.6 0.2 14.3
... ... ... ... ... ... ... ... ... ... ...
127 PV31 2016 0.4 0.1 5.6 0.2 0.4 5.0 0.3 12.0
128 PV31 2017 0.3 0.0 4.9 0.2 0.4 4.7 0.3 10.8
129 PV31 2018 0.3 0.0 4.3 0.2 0.3 4.6 0.4 10.1
130 PV31 2019 0.4 0.1 4.8 0.2 0.3 4.6 0.5 10.9
131 PV31 2020 0.3 0.1 4.6 0.1 0.3 4.1 0.5 10.0

132 rows × 10 columns

Now its time for lead_df

In [15]:
lead_df.rename(columns={'RegioS':'Province', 'Perioden':'Year',
                        'VolumeAfvalwater_43':'Vol_Wastewater', 
                        'Lood_52':'Lead'}, inplace= True)
lead_df
Out[15]:
ID Province Year Vol_Wastewater Lead
0 187 PV20 2010JJ00 74963 1619
1 188 PV20 2011JJ00 72217 1202
2 189 PV20 2012JJ00 72487 1485
3 190 PV20 2013JJ00 67297 1057
4 191 PV20 2014JJ00 64758 931
... ... ... ... ... ...
127 556 PV31 2016JJ00 158787 3327
128 557 PV31 2017JJ00 148311 .
129 558 PV31 2018JJ00 131743 4911
130 559 PV31 2019JJ00 143745 .
131 560 PV31 2020JJ00 148116 1754

132 rows × 5 columns

In [16]:
# First some general information
check_data(lead_df)
Missing data:
No missing data :)

Data types:
ID                 int64
Province          object
Year              object
Vol_Wastewater     int64
Lead              object
dtype: object

Description of the dataframe
               ID  Vol_Wastewater
count  132.000000      132.000000
mean   373.500000   158719.393939
std    114.395757   117111.886323
min    187.000000    26873.000000
25%    280.250000    64524.000000
50%    373.500000   118650.500000
75%    466.750000   237690.000000
max    560.000000   430470.000000

In this dataframe I want to change the Year and Lead columns to intergers

In [17]:
print(lead_df['Lead'].unique())
['    1619' '    1202' '    1485' '    1057' '     931' '    1629'
 '    1888' '       .' '    1299' '     838' '    1314' '    1671'
 '    1432' '    1354' '    1250' '    1268' '    1130' '    1197'
 '     828' '     810' '    1077' '     964' '    1024' '    1089'
 '    1105' '     900' '     918' '    2044' '    1958' '    1730'
 '    1842' '    2588' '    3006' '    2689' '    2355' '    2440'
 '     154' '     129' '     345' '     592' '     517' '     204'
 '     375' '     245' '     183' '    4149' '    3962' '    4203'
 '    4094' '    5572' '    2938' '    2791' '    2943' '    3168'
 '    2258' '    2625' '    1971' '    2289' '    1554' '    1937'
 '    1909' '    1756' '    1668' '    5043' '    5776' '    4752'
 '    7349' '    5200' '    5666' '    5091' '    5283' '    5318'
 '    9377' '   10010' '    8159' '    7209' '    8643' '    7361'
 '    7759' '    9075' '    6379' '    1283' '     667' '    1255'
 '     823' '    1045' '     825' '     744' '     668' '     657'
 '    5688' '    7995' '    6981' '    4731' '    5345' '    4684'
 '    7248' '    5748' '    3751' '    2821' '    3635' '    2676'
 '    3970' '    3025' '    5182' '    3327' '    4911' '    1754']

In this dataset there's also ' .' values. A value of 0kg lead in the wastewater does not make sense to me given all the other values are at least above 100, and the vol_wastewater is not decreased in these rows.

I'm going to interpolate these values and plot them between non interpolated values to see if this makes sense

In [18]:
missing_lead = lead_df[lead_df['Lead']=='       .'].index
missing_lead = np.array(missing_lead)

lead_plus_min = np.append(missing_lead-1, missing_lead+1)
lead_plus_min
Out[18]:
array([  6,   8,  17,  19,  28,  30,  39,  41,  50,  52,  61,  63,  72,
        74,  83,  85,  94,  96, 105, 107, 116, 118, 127, 129,   8,  10,
        19,  21,  30,  32,  41,  43,  52,  54,  63,  65,  74,  76,  85,
        87,  96,  98, 107, 109, 118, 120, 129, 131], dtype=int64)
In [19]:
#  replacing the '       .' value with NaN
lead_df['Lead'] = lead_df['Lead'].replace('       .', np.nan, regex=False)

# Typecasting the Year and Lead columns
lead_df['Year'] = lead_df['Year'].str.replace('JJ00','').astype(int)
lead_df['Lead'] = lead_df['Lead'].astype(float) # Float for now because NaN can't be int.

# Filling the NaN with an interpolated value
lead_df['Lead'] = lead_df['Lead'].interpolate().astype(int)
In [20]:
# Now to check if the interpolated data make sense

measured_lead_scatter = lead_df.iloc[lead_plus_min
                                     ].hvplot.scatter(x='Vol_Wastewater',
                                                      y='Lead',
                                                      label='Experimental data')

interpolated_lead_scatter = lead_df.iloc[missing_lead
                                     ].hvplot.scatter(x='Vol_Wastewater',
                                                      y='Lead', color='red',
                                                      label='Interpolated data')
                                     
factcheck = measured_lead_scatter * interpolated_lead_scatter
factcheck.opts(title='Measuered and interpolated lead data',
               ylabel='Lead (kg)', xlabel='Volume Infulent Wastewater (m3)',
               legend_position='top_left')
Out[20]:

Now we have the amount of wastewater in 1000 m3 and the amount of lead in the water in kg, I would like to create a column with the amount of lead per m3 of water

In [21]:
lead_df['lead_per_m3'] = lead_df['Lead'] / lead_df['Vol_Wastewater']
# Converting lead from kilogram to gram
lead_df['lead_per_m3'] = lead_df['lead_per_m3']*1000
lead_df
Out[21]:
ID Province Year Vol_Wastewater Lead lead_per_m3
0 187 PV20 2010 74963 1619 21.597321
1 188 PV20 2011 72217 1202 16.644280
2 189 PV20 2012 72487 1485 20.486432
3 190 PV20 2013 67297 1057 15.706495
4 191 PV20 2014 64758 931 14.376602
... ... ... ... ... ... ...
127 556 PV31 2016 158787 3327 20.952597
128 557 PV31 2017 148311 4119 27.772721
129 558 PV31 2018 131743 4911 37.277123
130 559 PV31 2019 143745 3332 23.179937
131 560 PV31 2020 148116 1754 11.842070

132 rows × 6 columns

In [22]:
# Lets check the data again now we're done cleaning and interpolating values.
check_data(lead_df)
Missing data:
No missing data :)

Data types:
ID                  int64
Province           object
Year                int32
Vol_Wastewater      int64
Lead                int32
lead_per_m3       float64
dtype: object

Description of the dataframe
               ID         Year  Vol_Wastewater          Lead  lead_per_m3
count  132.000000   132.000000      132.000000    132.000000   132.000000
mean   373.500000  2015.000000   158719.393939   2966.060606    17.756022
std    114.395757     3.174324   117111.886323   2414.776220     5.011926
min    187.000000  2010.000000    26873.000000    129.000000     4.798750
25%    280.250000  2012.000000    64524.000000   1086.000000    14.341746
50%    373.500000  2015.000000   118650.500000   1964.500000    17.531057
75%    466.750000  2018.000000   237690.000000   4735.500000    20.844686
max    560.000000  2020.000000   430470.000000  10010.000000    37.277123

Now I'm going to have a look at the Region column in both DataFrames, since this is the feature I'll be merging on

In [23]:
print(f"""
Crime regions:
{crime_df['Province'].unique()}

Lead regions:
{lead_df['Province'].unique()}""")
Crime regions:
['PV20  ' 'PV21  ' 'PV22  ' 'PV23  ' 'PV24  ' 'PV25  ' 'PV26  ' 'PV27  '
 'PV28  ' 'PV29  ' 'PV30  ' 'PV31  ']

Lead regions:
['PV20           ' 'PV21           ' 'PV22           ' 'PV23           '
 'PV24           ' 'PV25           ' 'PV26           ' 'PV27           '
 'PV28           ' 'PV29           ' 'PV30           ' 'PV31           ']

There clearly is some whitespace that needs removing.

In [24]:
crime_df['Province'] = crime_df['Province'].str.replace(r'\s','', regex=True)
lead_df['Province'] = lead_df['Province'].str.replace(r'\s','', regex=True)
In [25]:
print(f"""
Crime regions:
{crime_df['Province'].unique()}

Lead regions:
{lead_df['Province'].unique()}""")
Crime regions:
['PV20' 'PV21' 'PV22' 'PV23' 'PV24' 'PV25' 'PV26' 'PV27' 'PV28' 'PV29'
 'PV30' 'PV31']

Lead regions:
['PV20' 'PV21' 'PV22' 'PV23' 'PV24' 'PV25' 'PV26' 'PV27' 'PV28' 'PV29'
 'PV30' 'PV31']

Now both dataframes are ready to be merged!

In [26]:
lead_crime_df = lead_df.merge(crime_df, how='inner', on=['Province', 'Year'], 
                              copy=True)
lead_crime_df = lead_crime_df.drop(['Vol_Wastewater', 'Lead', 'ID'], axis=1)
lead_crime_df['Province'] = lead_crime_df['Province'].map({'PV20': 'Groningen',
                                                           'PV21': 'Fryslân',
                                                           'PV22': 'Drenthe',
                                                           'PV23': 'Overijssel',
                                                           'PV24': 'Flevoland',
                                                           'PV25': 'Gelderland',
                                                           'PV26': 'Utrecht',
                                                           'PV27': 'Noord-Holland',
                                                           'PV28': 'Zuid-Holland',
                                                           'PV29': 'Zeeland',
                                                           'PV30': 'Noord-Brabant',
                                                           'PV31': 'Limburg'})

lead_crime_df.rename(columns={'CRI1110':'Violent theft and burglary',
                              'CRI1500':'Extortion',
                              'CRI2100':'Vandalism',
                              'CRI2210':'Public violence',
                              'CRI2300':'Arson / Explosion',
                              'CRI3000':'Violent an sexual crimes',
                              'CRI7000':'Firearms crimes'},
                     inplace=True)

lead_crime_df
Out[26]:
Province Year lead_per_m3 Violent theft and burglary Extortion Vandalism Public violence Arson / Explosion Violent an sexual crimes Firearms crimes Total_incidence
0 Groningen 2010 21.597321 1.0 0.0 9.5 0.5 0.6 8.6 0.3 20.5
1 Groningen 2011 16.644280 0.7 0.0 9.1 0.4 0.6 8.0 0.4 19.2
2 Groningen 2012 20.486432 0.6 0.0 8.5 0.5 0.5 7.7 0.3 18.1
3 Groningen 2013 15.706495 0.5 0.0 7.4 0.3 0.3 7.2 0.3 16.0
4 Groningen 2014 14.376602 0.4 0.0 6.6 0.2 0.3 6.6 0.2 14.3
... ... ... ... ... ... ... ... ... ... ... ...
127 Limburg 2016 20.952597 0.4 0.1 5.6 0.2 0.4 5.0 0.3 12.0
128 Limburg 2017 27.772721 0.3 0.0 4.9 0.2 0.4 4.7 0.3 10.8
129 Limburg 2018 37.277123 0.3 0.0 4.3 0.2 0.3 4.6 0.4 10.1
130 Limburg 2019 23.179937 0.4 0.1 4.8 0.2 0.3 4.6 0.5 10.9
131 Limburg 2020 11.842070 0.3 0.1 4.6 0.1 0.3 4.1 0.5 10.0

132 rows × 11 columns

Above is the dataframe I'll be working with to answer my research question. Below I'll provide some information

Columns:

  • lead_per_m3: This is the amount of grams of lead found in the incoming waste water
  • Incidence: This is the incidence of violent crimes per 100.000 inhabitants of a province
In [27]:
print("CRIME DATA")
ds.DS_Q_Q_Plot(crime_df['Total_incidence'])
ds.DS_Q_Q_Hist(crime_df['Total_incidence'])
CRIME DATA
Estimation method: robust
n = 132, mu = 12.8, sigma = 3.855
Expected number of data outside CI: 7
Estimation method: robust
mu = 12.8, sigma = 3.855
In [28]:
print("LEAD DATA")
ds.DS_Q_Q_Plot(lead_df['lead_per_m3'])
ds.DS_Q_Q_Hist(lead_df['lead_per_m3'])
LEAD DATA
Estimation method: robust
n = 132, mu = 17.53, sigma = 4.821
Expected number of data outside CI: 7
Estimation method: robust
mu = 17.53, sigma = 4.821

Looking at the histogram of the total crime incidence values it's clear this data is not normally distributed. I'll take this into account when calculating a correlation between lead and crime

To see how the lead and total indicende behave together I will plot them together on a line plot over time. To account for the difference in y-values I'll first normaize the data.

In [29]:
normalized_lead_crime = lead_crime_df[['Province','Year',
                                       'lead_per_m3','Total_incidence']].copy()

for col in ['lead_per_m3', 'Total_incidence']:
    
    df_max = normalized_lead_crime[col].max()
    df_min = normalized_lead_crime[col].min()
    df_range = df_max - df_min

    normalized_lead_crime[col] = normalized_lead_crime[col
                                    ].apply(lambda x: ((x-df_min)/df_range))

normalized_lead_crime
Out[29]:
Province Year lead_per_m3 Total_incidence
0 Groningen 2010 0.517223 0.976378
1 Groningen 2011 0.364721 0.874016
2 Groningen 2012 0.483019 0.787402
3 Groningen 2013 0.335846 0.622047
4 Groningen 2014 0.294899 0.488189
... ... ... ... ...
127 Limburg 2016 0.497372 0.307087
128 Limburg 2017 0.707362 0.212598
129 Limburg 2018 1.000000 0.157480
130 Limburg 2019 0.565952 0.220472
131 Limburg 2020 0.216862 0.149606

132 rows × 4 columns

In [30]:
normalized_lead_crime.hvplot(kind='line', x='Year', 
                     y=['lead_per_m3','Total_incidence'], 
                     groupby='Province').opts(legend_position='top_right')
Out[30]:

Two observations here:

  • In all provinces the incidence of violent crimes decrease over time (hooray).
  • The lead content in inffluent waste water in most provinces jumps around over time. Sometime ending a higher, sometimes lower. Not very hopeful for a strong correlation
In [31]:
"""
Out of curiousity, I am going to have a look at the incidence of different 
crimes per province, over time. I've set a fixed y-axis so the decrese of 
incidence can be visually observed. 
"""

Crimes = ['Violent theft and burglary','Extortion', 'Vandalism', 
          'Public violence', 'Arson / Explosion', 'Violent an sexual crimes', 
          'Firearms crimes']

hv.output(widget_location='top_left')
lead_crime_df.set_index('Province'
                        ).hvplot(kind='bar', stacked=True,
                                 y=Crimes, groupby='Year',
                                 ylim=(0,25), title='Crimes per region, per year.',
                                 ylabel='Incidence per 100.000 inhabitants',
                                 rot=40)
Out[31]:

To answer my research question - Is there a correlation between lead in water and violent crimes - I'm going to perform a Spearman Correlation test. I've selected the Spearman test over for other correlation tests because it allows for non normally distributed data

In [32]:
from scipy.stats import spearmanr
correlation, p_val = spearmanr(lead_crime_df['lead_per_m3'], 
                              lead_crime_df['Total_incidence'])
print(f"""
The correlation between the lead per cubic meter of water and
total incidence of violent crimes is {correlation:.2f}, p value: {p_val}.
""")
The correlation between the lead per cubic meter of water and
total incidence of violent crimes is 0.23, p value: 0.006989339126132411.

In [33]:
scatter_data = hv.Scatter(data=lead_crime_df, kdims='lead_per_m3', 
                          vdims='Total_incidence', label='Regression line')
scatter = lead_crime_df.hvplot.scatter(
                            x='lead_per_m3',
                            y='Total_incidence',
                            xlabel='Lead per m3 of water (in grams)',
                            ylabel='Total incidence per 100.000 inhabitants',
                            label='Data',
                            title='Lead in waste water over incidence of violent crime')

Regression_line = hv.Slope.from_scatter(scatter_data).opts(color='black')

print("The black line is the regression line")

scatter.opts(legend_position='top_right') * Regression_line
The black line is the regression line
Out[33]:
In [34]:
# In the above graph we see four datapoints that seem to be a bit out of the ordinary. 
# I'm going to dorp them to see if this has any impact.

to_drop = lead_crime_df[(lead_crime_df['lead_per_m3'] < 6) &
                        (lead_crime_df['Total_incidence'] > 18)
                        ].index.tolist()

to_drop.extend(lead_crime_df[(lead_crime_df['lead_per_m3'] > 35) & 
                             (lead_crime_df['Total_incidence'] < 14)
                             ].index.tolist())
to_drop

lead_crime_df.drop(to_drop, inplace=True)
In [35]:
correlation, p_val = spearmanr(lead_crime_df['lead_per_m3'], 
                              lead_crime_df['Total_incidence'])
print(f"""
The correlation between the lead per cubic meter of water and
total incidence of violent crimes is {correlation:.2f}, p value: {p_val}.
""")
The correlation between the lead per cubic meter of water and
total incidence of violent crimes is 0.30, p value: 0.0005395582436530309.

Removing these four datapoints does impact the correlation somewhat (from 0.23 to 0.30), More interestingly, it increases the P value to below 0.05, indicating significance.

In [36]:
scatter_data = hv.Scatter(data=lead_crime_df, kdims='lead_per_m3', 
                          vdims='Total_incidence')

scatter = lead_crime_df.hvplot.scatter(
                            x='lead_per_m3',
                            y='Total_incidence',
                            xlabel='Lead per m3 of water (in grams)',
                            ylabel='Total incidence per 100.000 inhabitants',
                            label='Data',
                            title='Lead in waste water over incidence of violent crime, outliers excluded')

Regression_line2 = hv.Slope.from_scatter(scatter_data).opts(color='red')

print("The black line is the regression line with the outlires included")
print("The red line is the regression line with the outlires excluded")

(scatter * Regression_line
 * Regression_line2).opts(legend_position='top_right') 
The black line is the regression line with the outlires included
The red line is the regression line with the outlires excluded
Out[36]:
In [37]:
# This is confirmed visually.

So, looking at the total incidence and lead per m3 of water, there is no significant correlation. Next I'm going to look at the data per Province to see if there are trends visible on a smaller scale

In [38]:
corr_lead_crime_province = lead_crime_df[['lead_per_m3','Total_incidence','Province']
                                        ].groupby('Province'
                                        ).corr(method='spearman').reset_index(
                                            )[::2][['Province','Total_incidence']]

corr_lead_crime_province.rename(columns={'Total_incidence':'Correlation lead & crime'}, 
                                inplace=True)

corr_lead_crime_province.set_index('Province').hvplot.bar(rot=40,
                                                         xlabel='Province of the Netherlands')
Out[38]:
In [39]:
# Stratify per crime and / or Province
scatter_data = hv.Scatter(data=lead_crime_df, kdims='lead_per_m3', 
                          vdims='Total_incidence', label='Regression line')

scatter = lead_crime_df.hvplot.scatter(
                            x='lead_per_m3',
                            y='Total_incidence',
                            groupby='Province',
                            xlabel='Lead per m3 of water (in grams)',
                            ylabel='Total incidence per 100.000 inhabitants',
                            label='Data', 
                            title='Lead in waste water over incidence of violent crime per Province')

Regression_line2 = hv.Slope.from_scatter(scatter_data).opts(color='red')

(scatter * Regression_line * Regression_line2 ).opts(legend_position='top_right')
Out[39]:

Finally, I want to plot the incidence and lead per m3 of wastewater per Province, over a map of the Netherlands.

In [40]:
# Geoplotten

def get_map():
    m = folium.Map(location=[52.2594406805025, 4.952074743739346],
                   zoom_start=7)
    
    url = 'https://www.webuildinternet.com/articles/2015-07-19-geojson-data-of-the-netherlands/provinces.geojson'

    borderstyle = {
        'color' : 'black',
        'fill' : False,
        'weight' : 1
    }
    
    border = folium.GeoJson(data=url, name='borders', 
                             style_function= lambda x: borderstyle, 
                             control = False).add_to(m)

    incidence = folium.Choropleth(
        geo_data=url,
        name="Incidence",
        data=lead_crime_df.replace('Fryslân', 'Friesland (Fryslân)'),
        columns=["Province", "Total_incidence"],
        key_on="properties.name", # Thanks to job for helping me figure out wich value to use here!
        fill_color="YlOrRd",
        fill_opacity=0.7,
        line_opacity=1,
        legend_name="Total violent crime incidence (per 100.000 inhabitants)",
        show = False
    ).add_to(m)

    lead = folium.Choropleth(
            geo_data=url,
            name="Lead",
            data=lead_crime_df.replace('Fryslân', 'Friesland (Fryslân)'),
            columns=["Province", "lead_per_m3"],
            key_on="properties.name",
            fill_color="BuPu",
            fill_opacity=0.7,
            line_opacity=1,
            legend_name="Total violent crime incidence (per 100.000 inhabitants)",
            show = False,
            hilight = True
    ).add_to(m)

    minimap = plugins.MiniMap(toggle_display=True)
    m.add_child(minimap)

    folium.LayerControl(collapsed=False).add_to(m)

    return m

mapobject = get_map()
mapobject.save('Lead & Crime incidence NL.html')

mapobject
Out[40]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Conclusion: There is a low but significant correlation between lead found in wastewater and violent crimes, looking at the netherlands as a whole. Per province, the correlation between lead in influent waste water and violent crime incidence varies greatly from -0.445 to 0.745 so looking at the all the data, I suspect this the low correlation does not indicate causation.